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Abstract 

Background: Sinorhizobium meliloti is a soil-dwelling a-proteobacterium that possesses a large, tripartite genome 
and engages in a nitrogen fixing symbiosis with its plant hosts. Although much is known about this important 
model organism, global characterization of genetic regulatory circuits has been hampered by a lack of information 
about transcription and promoters. 

Results: Using an RNAseq approach and RNA populations representing 16 different growth and stress conditions, 
we comprehensively mapped 5. meliloti transcription start sites (TSS). Our work identified 17,001 TSS that we 
grouped into six categories based on the genomic context of their transcripts: mRNA (4,430 TSS assigned to 2,657 
protein-coding genes), leaderless mRNAs (171), putative mRNAs (425), internal sense transcripts (7,650), antisense 
RNA (3,720), and trans-encoded sRNAs (605). We used this TSS information to identify transcription factor binding 
sites and putative promoter sequences recognized by seven of the 15 known 5. meliloti o factors o 70 , o 54 , o H1 , o H2 , 
o E1 , o E2 , and o E9 ). Altogether, we predicted 2,770 new promoter sequences, including 1,302 located upstream of 
protein coding genes and 722 located upstream of antisense RNA or trans-encoded sRNA genes. To validate 
promoter predictions for targets of the general stress response o factor, RpoE2 (o E2 ), we identified rpo£2-dependent 
genes using microarrays and confirmed TSS for a subset of these by 5' RACE mapping. 

Conclusions: By identifying TSS and promoters on a global scale, our work provides a firm foundation for the 
continued study of 5. meliloti gene expression with relation to gene organization, o factors and other transcription 
factors, and regulatory RNAs. 

Keywords: Transcription, RNAseq, Transcription start site, Promoter, Sigma factor, Sinorhizobium meliloti, mRNA, 
sRNA, Antisense RNA 



Background 

Transcription is the first committed step of gene expression 
in prokaryotes, and as such is highly regulated. Promoter 
sequences direct the transcription of both coding and non- 
coding RNAs by acting as target sites for specific RNA 
polymerase binding and activity [1-3]. Bacteria employ an 
ingenious machinery to adapt effectively and economically 
to conditions of stress or environmental changes. The ini- 
tial mechanism of transcription regulation is based on 
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recruitment of certain sigma (a) factors by RNA polymer- 
ase core enzyme (RNAP). Complex formation with o fac- 
tors is essential for RNAP binding to a particular promoter 
sequence and thus for transcription initiation [3]. 

Genome-wide identification of promoter sequences facil- 
itates identification of DNA-binding sites for regulatory 
proteins and also provides insights into the organization of 
transcriptional units. RNAseq approaches have allowed the 
large-scale identification of transcription start sites (TSS) in 
the e-proteobacterium Helicobacter pylori and the cyano- 
bacteria Synechocystis sp., Synechococcus elongatus, and 
Anabaena sp., facilitating the characterization of promoter 
sequence motifs upstream of the TSS. These approaches 
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revealed an unexpected abundance of ds-encoded antisense 
RNAs (asRNA) and trans-encoded sRNAs (sRNA), and 
thereby illuminated a previously unknown dimension of 
prokaryotic transcriptional activity [4-7]. This picture is 
consistent with RNAseq studies in diverse Gram-positive 
and Gram-negative bacteria [8-17]. 

The a-class of the Proteobacteria comprises diverse 
bacteria with complex lifestyles, including obligate and 
facultative plant- and animal-associated bacteria (engaging 
in both mutualistic and pathogenic interactions), 
phototrophs, chemoorganotrophs, and chemolithotrophs 
[18]. In addition, eukaryotic mitochondria are proposed to 
have arisen from an ancient a-proteobacterium [18]. Des- 
pite their importance, little is known about a- 
proteobacterial transcriptome organization. This study 
presents the first comprehensive mapping of TSS and as- 
signment of identified promoter sequences to o factors in 
an a-proteobacterium. Sinorhizobium meliloti exists either 
in a free-living life-style in the soil or in symbiosis with a 
leguminous plant host. In the symbiotic relationship, the 
bacteria inhabit root nodules, differentiate into polyploid 
bacteroids, and fix nitrogen to the benefit of the host [19]. 
Functions for these distinct lifestyles are encoded in the 
tripartite S. meliloti genome: a single chromosome (3.54 
Mbp) and two megaplasmids, pSymA (1.35 Mbp) and 
pSymB (1.68 Mbp) [20-22]. 

To adapt to environmental changes or stress situations, 
S. meliloti can draw on a set of 15 a factors [20,23]. RpoD 
(a 70 ) provides for housekeeping functions, while alterna- 
tive o* factors are usually involved in adaptation to specific 
stresses or growth conditions [24]. RpoN (a 54 ) is essential 
for transcription of nitrogen fixation-related genes [25]. 
Two RpoH o* factors, with sequence similarity to the 
Escherichia coli heat shock a 32 , were identified in S. 
meliloti [26]; RpoHl (a^ 11 ) was found to be largely respon- 
sible for adaptation to heat shock, oxidative stress, and pH 
changes, whereas the role of RpoH2 (o* 1 ^ 2 ) is largely un- 
known [26]. At least 11a factor genes (rpoEl-rpoE10,fecI) 
are annotated in the S. meliloti genome as encoding 
extracytoplasmic function (ECF) o* factors, which are usu- 
ally regulated by anti-a factors. The ECF o* factor, RpoE2 
(o^ 2 ), was characterized as the regulator of at least 44 
genes, including rpoH2 and rpoES, and was inferred to be 
a global regulator of general stress adaptation and the 
hyperosmotic stress response [27,28]. 

The RNAseq approach reported in this study obtained 
17,001 experimentally mapped TSS, including both 
protein-coding and non-coding transcripts. We were able 
to predict 2,847 o* factor-specific promoter sequences in 
appropriate distance to a TSS. This newly defined land- 
scape of TSS and promoter motifs increases our know- 
ledge of promoter properties and will facilitate further 
analyses of transcriptional and post-transcriptional regula- 
tion processes in S. meliloti. 



Results and discussion 

Global mapping of transcription start sites 

To comprehensively map TSS of the S. meliloti 1021 
transcriptome, RNA populations derived from 16 differ- 
ent growth and stress conditions were pooled and used 
for cDNA library preparation (Methods). These condi- 
tions included exponential and stationary phase growth 
in three different media, temperature and pH shifts, oxi- 
dative and high salt stress, microoxia, carbon or nitrogen 
starvation, and exposure to the plant-secreted flavonoid, 
luteolin. This flavonoid induces expression of bacterial 
nodulation genes, which are required for the establish- 
ment of symbiosis [29]. S. meliloti 1021 carries a muta- 
tion in expR encoding a global LuxR-type transcriptional 
regulator engaged in quorum sensing. This regulator af- 
fects multiple processes including exopolysaccharide 
biosynthesis and motility [30-32]. In order to map ExpR- 
dependent transcripts, which would otherwise be lack- 
ing, RNA from an expR + derivative of S. meliloti 1021 
was also included in our pool. Enzymatic treatment of 
RNA samples with terminator S'-phosphate-dependent 
exonuclease resulted in enrichment of transcripts with 
primary 5' ends, prior to cDNA synthesis and deep se- 
quencing (Methods). 

Our procedure for mapping and classifying TSS is dia- 
grammed in Additional file 1: Figure SI and described in 
detail in Methods. Briefly, 37,792,608 reads were obtained, 
of which 11,230,166 passed all filtering criteria, and subse- 
quently were subjected to TSS determination and classifi- 
cation. TSS assigned to tRNAs, repeats, and transposons 
are listed separately (Additional file 2: Table SI). 

Overall, 17,001 TSS, not including those listed in Table 
SI, were grouped into six categories based on their gen- 
omic context with respect to a minimal transcription 
unit (MTU) model (Methods; Figure la; Additional file 
2: Table S2). TSS representing the prominent 5' end of a 
sequence contig overlapping in sense orientation a re- 
gion of 54 nt upstream of the start codon of protein- 
coding genes were classified as (i) TSS of mRNAs 
(mTSS). 54 nt were defined as the minimal region up- 
stream of the ATG to cover promoter motifs and the 
ribosome binding site, which are more likely to be asso- 
ciated with an mRNA than to a trans-encoded sRNA. 
(ii) TSS were assigned to leaderless transcripts (lmTSS), 
if matching the first nucleotide of the translation initi- 
ation codon. The class of (iii) putative TSS of mRNAs 
(pmTSS) comprises TSS that are difficult to distinguish: 
for any given pmTSS, it is uncertain if the TSS represents 
an mRNA of a protein-coding gene possessing a long 5'- 
UTR or a TSS of a trans -encoded sRNA. (iv) Sense TSS 
(seTSS) represent internal transcripts in the same orien- 
tation as, and located within, protein-coding genes, (v) 
TSS of c/s-encoded antisense RNAs (asTSS) are orien- 
tated in antisense to protein-coding target genes, and (vi) 
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Figure 1 Identification of transcription start sites (TSS) in the S. meliloti genome, (a) Schematic showing parameters of TSS types based on 
the minimal transcription unit model described in Methods, (i) TSS of an mRNA. (ii) TSS of a leaderless transcript, (iii) TSS of a putative mRNA. 
(iv) TSS of a sense transcript, (v) TSS of a c/'s-encoded antisense RNA. (vi) TSS of a trans-encoded sRNA. (b) Strand and replicon specific 
distribution of TSS by category. The y-axis indicates log 10 coverage for mapped reads, (c) Number of protein-coding genes, asRNAs and sRNAs 
possessing 1, 2, 3, 4 or >4 TSS. The last two rows give the number of genes or IGRs possessing 1, 2, 3, 4 or >4 asRNAs or sRNAs. (d) TSS 
associated with pstS, ctrA, and SMc02853 (indicated by blue triangles). Putative transcription factor binding motifs are shown as orange 
rectangles, (e) TSS associated with SMb_asRNA_3039 and SMa_sRNA_136 (indicated by blue triangles). 



TSS of trans-encoded sRNAs (sTSS) are located in 
intergenic regions (IGR) and within a defined distance 
from neighboring genes. Figures lb and lc give an over- 
view of the location and number of TSS in each category, 
which are discussed in more detail below. 



TSS categories 

mTSS (TSS associated with mRNAs) 

A total of 4,430 mTSS were assigned to 2,657 protein- 
coding genes (Figures lb and lc; Additional file 2: Table 
S2). The total number of S. meliloti annotated protein- 



Schluter et al. BMC Genomics 2013, 14:156 
http://www.biomedcentral.eom/1 471 -21 64/1 4/1 56 



Page 4 of 21 



coding genes is 6296 [33], and 1090 operons were pre- 
dicted [34]. Assuming a rough estimate of three co- 
transcribed genes per operon, we have identified TSS for 
most protein-coding genes in the S. meliloti genome. 
About 60% (1,644 genes) appear to contain a single TSS, 
while the remaining genes contain multiple TSS. At least 
12 putative TSS were identified for each of the genes 
SMc01904, SMc02122, and SMc02396 (Additional file 2: 
Table S2). The most frequent initiating nucleotide for 
mTSS transcripts was A (47.9%), followed by G (23.6%), 
T (14.8%), and C (13.7%) (Figure 2a). This agrees with ob- 
servations that purine nucleotides are the most common 
initiating nucleotide in bacteria [15,35], a preference that 
may be explained by the relatively larger pool size of pur- 
ine vs. pyrimidine nucleotides [36]. 

The existence of alternative promoters for a single 
gene is one explanation that could account for multiple 
TSS, as exemplified by pstS, which encodes a phosphate- 



binding periplasmic protein, and was previously shown 
to have two TSS located downstream from two PhoB 
binding motifs (Bl and B2) corresponding to TSS S8 and 
S9 of Figure Id [37]. Similarly, ctrA, which encodes a glo- 
bal cell cycle regulator, showed two strongly accumulating 
transcripts corresponding to TSS S2 and S6 (Figure Id) 
that were previously identified downstream from con- 
served promoter motifs (Figure Id) [38]. Alternatively, 
post-transcriptional processing of the 5'-UTR of mRNAs 
may account for multiple mTSS located downstream of 
TSS associated with conserved promoter motifs, as is 
plausible for six TSS of pstS (SI, S3, S4, S5, S8, and S9 in 
Figure Id) and four TSS of ctrA (SI, S3, S4, and S5 
in Figure Id). 

The average mTSS-to-start-codon distance was ~ 68 nt 
(Figure 2b), somewhat longer than previously reported for 
Salmonella enter ica serovar Typhimurium [15] and H. pyl- 
ori [4]. Moreover, the average distance increases to -80 nt 
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Figure 2 Characteristics of S. meliloti transcripts, (a) Nucleotide composition for the first 3 nt of mRNAs Positions +1, +2, and +3 are 

indicated, of which +1 is the TSS. (b) Distance of TSS to start codons. In orange: TSS of leaderless transcripts (ImTSS); purple: TSS located close to 
the start codon that were classified as putative TSS of sense sRNAs (seTSS); green: TSS of mRNAs, including c/'s-encoded mRNA leader sequences 
of [17] blue: TSS of putative mRNAs (pmTSS). (c) (i a ) Transcription units identified by both the TSS-specific RNAseq and the small RNA-specific 
RNAseq approach (applying stringent [17] and relaxed thresholds). (i b ) mRNA leader sequence (SmelCR345), (ii) sRNA (SmelBR028), (iii) leaderless 
transcript (SmelAR015) and (iv) asRNA (SmelBR068). TSS are indicated by blue triangles. 
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if pmTSS lengths (see below) are included in the calcula- 
tion. In 40% of all cases, the distance between the mTSS 
and start codon is between 40 and 100 nt The maximum 
distance was 309 nt for the mRNA leader transcript of 
SMc02724 (Additional file 2: Table S2; [17]). 5'-UTRs lon- 
ger than 100 nt were found preceding 1,041 genes (1,466 
if pmTSS are also included). This indicates that long 5'- 
UTRs are not unusual in S. meliloti and may even signify 
complex mechanisms of gene regulation, as is likely the 
case for ctrA (Figure Id; [38]). 

To identify new protein-coding genes that were missed in 
previous genome annotations [20,33], we followed the pro- 
cedure described in Methods and outlined in Additional file 
1: Figure S2. This included screening for translational sig- 
nals, such as ribosome binding sites (RBS) and translational 
start and stop codons. This approach yielded 221 TSS, asso- 
ciated with 71 new and 150 alternative translational units 
(Additional file 2: Table S3). To increase the probability of 
annotating only genuine open reading frames (ORFs), we 
searched the NCBI nonredundant protein database for 
orthologs of these newly identified genes. Genes that had 
orthologs in other a-proteobacteria were named according 
to the scheme, SMx_ORF_x, and added to the GenDB S. 
meliloti genome database [39]. 

I mTSS (TSS associated with leader less transcripts) 

171 TSS of leaderless transcripts mapped to the first nu- 
cleotide of the corresponding start codon (Figures la, lb, 
and 2b). Only a single TSS was found for 112 of these 
lmTSS, with the remainder found in combination with 
mTSS (n=52), pmTSS (n=4), or both (n=3) (Additional 
file 2: Table S2). Leaderless transcripts were also observed 
in Synechocystis sp. PCC6803, E. coli, Rhodobacter 
sphaeroides, and Pseudomonas putida [40-43]. All lmTSS 
initiate with an AUG triplet, consistent with results in 
E. coli showing that the S'-terminal AUG of leaderless 
mRNAs is important for ribosome recognition and inter- 
action [44], and that non-AUG start codons are inefficient 
[45]. Also, a recent study proposed a mechanism for lead- 
erless mRNA translation in E. coli [40] . In this mechanism, 
the mazEF stress-induced toxin-antitoxin system, in par- 
ticular the endoribonuclease MazF, plays an important 
role in maturation of both leaderless mRNAs and a novel, 
processed 16S rRNA. Assembly of the ribosome with this 
novel rRNA molecule results in an alternative translational 
machinery, which is able to translate two classes of leader- 
less transcripts: those processed via MazF and naturally 
occurring leaderless mRNAs [40]. In S. meliloti, our obser- 
vation that most genes with leaderless transcripts possess 
a single TSS (n=112), suggests that naturally occurring 
leaderless mRNAs are preferred over processed leaderless 
transcripts. It is possible that the 59 leaderless mRNAs for 
which an alternative TSS was identified were processed via 



endoribonucleases similar to MazF in E. coli, but such a 
mechanism has not yet been described in S. meliloti 

Organization of protein-coding genes into operons is a 
common feature in bacterial genomes. Recently, we 
reported 67 experimentally validated operons and 1,090 
predicted operons in the S. meliloti 1021 genome [34]. TSS 
for 47 of those experimentally validated operons and 613 of 
the predicted operons were identified in our RNAseq data 
(Additional file 2: Table S4). 

pmTSS (TSS associated with putative mRNAs) 

A special situation is illustrated by 425 pmTSS, of which 
122 represent the only TSS identified for the correspond- 
ing gene (Figure lb; Additional file 2: Table S2). In many 
cases, it is unclear if these pmTSS represent mRNAs with 
unusually long 5'-UTRs or non-coding transcripts. One 
such example is illustrated by SMc02853, which possesses 
two pmTSS (Figure Id). On the other hand, the most dis- 
tal TSS of ctrA (S6 in Figure Id) almost certainly repre- 
sents a transcript with a long 5'-UTR (291 nt), because 
CtrA was shown to bind upstream of this TSS [38]. CtrA 
also binds to four additional sites downstream of the S6 
TSS suggesting a complex mechanism of autoregulation 
(Figure Id; [38]). 

seTSS (TSS associated with internal sense transcripts of 
ORFs) 

Sense TSS must meet two criteria: they must be located in 
the same orientation as a protein-coding gene and within 
the corresponding ORF. Sense transcripts are the largest 
group of TSS in our study: 7,650 were classified as belong- 
ing to sense transcripts; and therefore, approximately half 
of all identified TSS correspond to internal fragments of 
mRNA transcripts (Figure lc; Additional file 2: Table S2). 

A high proportion of sense transcripts was also observed 
in RNAseq studies of other prokaryotes, e.g. Synechocystis 
sp. PCC6803 and Anabaena sp. PCC7120, but the func- 
tional role of these RNA fragments, if any, remains to be 
elucidated [5,6]. It is possible that sense transcripts may act 
as target mimicry molecules that sequester sRNAs, 
asRNAs, or ribonucleases from their respective mRNA tar- 
get domains [5]. Presumably, a small proportion of sense 
fragments may represent alternative mRNAs that allow for 
synthesis of shorter protein isoforms in S. meliloti How- 
ever, a more likely explanation is that despite the procedure 
we used to enrich for transcripts with primary 5' ends, the 
majority of sense fragments correspond to abundant 
processed derivatives of their respective mRNAs rather 
than independent, primary mRNAs transcribed from their 
own promoters [17]. 

asTSS (TSS of cis-encoded antisense sRNAs) 

Gs-encoded antisense RNAs act as regulatory RNA via per- 
fect binding to their corresponding target mRNAs encoded 
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on the opposite DNA strand. The first genome-wide 
RNAseq-based screen for non-coding transcripts in S. 
meliloti identified 117 asRNAs [17]. However, antisense 
RNA detection was limited in that study, because the proto- 
col was designed to preferentially detect short RNAs. This 
study greatly increased the number of antisense transcripts 
with the identification of 3,720 asTSS that were assigned to 
2957 asRNAs associated with the noncoding strand of 
protein-coding genes (Figures lb and lc; Additional file 2: 
Table S2). Approximately 35% of the protein-coding genes 
in S. meliloti (2,210 out of 6,296 genes) had antisense tran- 
scripts (Figure lc; Additional file 2: Table S2). For 1,648 tar- 
get genes, a single asTSS was identified, while the remaining 
genes appear to contain two or more asTSS (Figure lc). An 
example of an antisense RNA with three TSS, 
SMb_asRNA_3039, is shown in Figure le. Occurrence of 
numerous antisense transcripts associated with a single 
protein-coding gene most likely represents processed 
products of a single, primary non-coding asRNA, rather 
than transcripts of independent asRNA genes. Examples 
are SMb21548, SMc02498, and SMc03761 with 50, 14, 
and 12 asTSS, respectively (Additional file 2: Table S2). 
The three replicons, chromosome, pSymA, and pSymB rep- 
resent 54.6%, 20.2%, and 25.2% of the total genome size in 
bp. Taking into account the different replicon sizes, anti- 
sense transcripts were weakly overrepresented on pSymA 
(23.9%) and pSymB (26.5%) (Figure lb; Additional file 2: 
Table S2). The strand-specific location of antisense tran- 
scripts on pSymB shows a slight preference for the minus 
strand, whereas it is virtually equal on the remaining 
replicons (Figure lb). 

The observed features for ds-encoded antisense sRNAs 
in S. meliloti are in good agreement with other bacterial 
transcriptomes. Three studies detected antisense transcrip- 
tion of -27% to -50% of the annotated protein-coding genes 
in two cyanobacteria and H. pylori [4-6]. It is tempting to 
speculate that antisense transcriptional activity is widespread 
in prokaryotic genomes. However, antisense transcriptional 
activity represented less than 5% of the protein-coding genes 
in Chlamydia trachomatis, Staphylococcus aureus, P. 
syringae, Vibrio cholerae, and S. enterica [12,15,46-48]. It is 
unknown whether this variation reflects differences in ex- 
perimental and data analysis procedures or biological 
mechanisms. 

The total length of each transcript in our study is un- 
known due to the method we used for cDNA library prep- 
aration. Prokaryotic asRNAs are generally short 
transcripts of approximately 100 to 300 nt [2]. This num- 
ber is in good agreement to the size range (59 to 258 nt) 
of 117 S. meliloti asRNAs identified in Schluter et al. [17]. 
However, several asRNAs of other bacterial species were 
as long as 7,000 nt [49-53]. 

Depending on the location of an asRNA relative to its 
target gene, various mechanisms have been proposed. In 



S. meliloti, 441 predicted asRNAs overlapped the 5'-UTR 
of the associated mRNA (Additional file 2: Table S2). 
These may function as described for SymR in E. coli and 
RnaG in Shigella flexneri: ds-encoded asRNAs, antisense 
to 5'-UTR domains of their relative target genes, are able 
to modulate transcription, translation efficiency, and 
mRNA stability [54,55]. In S. meliloti, 250 asRNAs overlap 
the 3'-UTR of their particular target genes (Additional file 
2: Table S2) and therefore, might influence the target 
mRNA stability in a similar fashion as described for the 
RatA/TxpA toxin- antitoxin system in B. subtilis. Duplex 
formation by ratA and txpA transcripts leads to txpA 
mRNA degradation, and thus prevents toxin synthesis 
[56,57]. The largest group of ds-encoded asRNAs in 
S. meliloti (n=2,266) comprises transcripts antisense to the 
protein-coding domain of their target mRNA (Additional 
file 2: Table S2). Several regulatory mechanisms have been 
postulated for this type of asRNA. An example of targeted 
co-degradation of an mRNA/asRNA duplex is illustrated 
by isiA/lsrR of Synechocystis sp. PCC6803 [58,59]. Other 
regulatory mechanisms do not focus on the asRNAs 
themselves, but on the strength of their respective pro- 
moters, RNAP elongating rates, and RNAP competition 
events. Three mechanisms of transcriptional interference, 
promoter collision, promoter occlusion, and "sitting duck" 
interference (the dislodgement of RNAP slowly initiating 
transcription from one promoter by an elongating RNAP 
from another promoter), were proposed to explain the 
interfering effects of RNAP complexes acting in opposite 
directions [60]. 

sTSS (TSS of trans-encoded sRNAs) 

Trans-encoded sRNAs are usually located in IGRs, and 
in contrast to antisense transcripts, do not overlap their 
target genes. In total, 605 sTSS were associated with 
trans-encoded sRNAs (Figure lb; Additional file 2: Table 
S2). As described for SmelA075, SmelA060, and 
SmelA072 in S. meliloti, and GadY in E. coli, processed 
variants of non-coding transcripts are not unusual in 
bacteria [17,61]. Thus, we assigned closely neighboring 
TSS to a single sRNA region. An example for newly 
identified SMa_sRNA_136 is shown in Figure le. The 
majority of trans-encoded sRNAs (n=362) were assigned 
only a single TSS, while 78 exhibited at least one alterna- 
tive TSS (Figure lc; Additional file 2: Table S2). There- 
fore, our RNAseq approach identified 440 sRNAs 
located in 302 IGRs (Figure lc; Additional file 2: Table 
S2). Newly identifed sRNAs were named following the 
scheme SMx_sRNA_x and added to the S. meliloti 
GenDB database [39]. 

Comparison of our RNAseq approach to that of Schluter 
et al. [17], confirmed 259 mTSS, 29 sTSS, 20 asTSS, and 43 
seTSS (Figure 2c; Additional file 2: Table S2). For further 
validation, we compiled an additional set of transcripts 
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from the previous 454 RNAseq data [17], but applied less 
stringent criteria (Methods). Comparison of the additional 
set of transcripts to those of this study showed concordance 
for 517 (Additional file 2: Table S2). For transcripts identi- 
fied in both data sets, locations of the 3' ends were retrieved 
from the previous 454 RNAseq data and are provided in 
Additional file 2: Table S2. Examples of four such tran- 
scripts are shown in Figure 2c. 

Our RNAseq analysis revealed 440 sRNAs; therefore, 
the ratio of CDS to trans -encoded sRNAs in the S. meliloti 
genome is 14.3 (Figure lc; Additional file 2: Table S2). 
Somewhat lower CDS/£raws-encoded sRNA ratios were 
observed in V. cholerae, Anabaena sp. PCC7120, and 
Synechocystis sp. PCC6803, with proportions of 8, 4.4, and 
10, respectively [5,6,48]. In S. meliloti, the majority of 
sRNA genes (n=180) are located as single sRNA loci in a 
single IGR. However, 85 IGRs harbor at least two or three 
and 37 IGRs harbor at least four sRNA genes (Figure lc). 

Although pSymA represents only 25.2% of the total gen- 
ome size in bp, 40.9% of the trans-encoded sRNAs identi- 
fied in this study mapped to pSymA, indicating an over- 
representation of sRNA genes on pSymA and an under- 
representation on the chromosome (40.7%) and pSymB 
(18.4%) (Figure lb; Additional file 2: Table S2). Considering 
that many genes required for symbiosis are located on 
pSymA [21], it is tempting to speculate that trans-encoded 
sRNAs play an important role in post-transcriptional regu- 
lation of the symbiotic genes. 

Aside from the previously identified 6S RNA 
(SMc_sRNA_307), the signal recognition particle 
(SMc_asRNA_791), and the tmRNA (SMc_sRNA_283) 
[17], assigning functions to sRNAs remains a complex 
challenge. This study extends our knowledge of sRNAs 
and 5'-UTRs and consequently will facilitate prospective 
functional analyses of sRNA/5'-UTR interactions. 

Comparison to published TSS data 

We compared 124 S. meliloti 1021 TSS, published over 
the past several decades, to our RNAseq TSS (Additional 
file 2: Table S5). This set of previously published TSS is 
biased toward ^//-dependent genes because 69 of the 
124 TSS were reported in a single study [62]. Of the 124 
published TSS, 60% matched an RNAseq TSS exactly, 
11% differed by only 1-4 nt, and 7% differed by >5 nt. 
The remaining 27 previously reported TSS were not 
identified in our RNAseq data. Some of these may be 
explained by the different growth conditions tested in 
each study. For example, fixA and nifli are expressed 
primarily in symbiosis, a condition that was not included 
in our study for technical reasons. Additionally, Barnett 
et al. [62] used different growth media than we did, 
which may explain why we failed to identify 11 rpoH- 
dependent TSS. Two genes, nodD3 and syrA, previously 
reported to have low expression and extremely long 5'- 



UTR [63,64], were also absent in our RNAseq data. We 
identified a TSS for ndvA in our RNAseq data that was 
misclassified by our automated system as belonging to 
the 5'-UTR of an asRNA (Additional file 2: Table S5). 
Other reported TSS, such as those corresponding to 
hypM, rpoN, and trkH, had matching reads in our 
RNAseq data, yet failed to meet the coverage cutoff. 
These examples illustrate the delicate balance between 
sensitivity and specificity in global studies; in our study, 
a cutoff of >30 reads was chosen to provide the best bal- 
ance between identification of true positives and back- 
ground noise. 

As previously discussed, a major strength of our study 
is the comprehensiveness of our data. Depending on the 
location of the specific primer(s) chosen, alternate TSS 
may be missed by primer extension or 5' RACE map- 
ping. Hence, in comparing our TSS data to published re- 
ports, we identified 45 genes with multiple TSS, where 
just one TSS had been reported before (Additional file 2: 
Table S5). In sum, the TSS identified in this study are in 
good agreement with previous reports. 

Promoter prediction 

The -17,000 TSS identified with RNAseq are an excellent 
resource with which to increase our knowledge about pro- 
moter properties in S. meliloti. As described above, previ- 
ous promoter mapping has focused on regions upstream 
of a limited number of protein-coding genes. In addition, 
bioinformatics-based predictions have disregarded pro- 
moter motifs not associated with, or in antisense orienta- 
tion to, protein-coding genes [27,62,65]. However in this 
study, TSS corresponding to -3,400 asRNAs or sRNAs 
allowed us to directly identify 724 promoter sequences up- 
stream of these genes, in addition to 1,371 promoters 
identified upstream of protein-coding (mTSS and lmTSS) 
genes (Figures lc and 3a). 

The o subunit of RNAP confers promoter specificity 
through interaction with conserved sequences upstream 
of TSS [3]. The S. meliloti genome encodes 15 a factors: 
14 of these belong to the a 70 family, and thus are predicted 
to recognize -35/-10-type promoters, while one (RpoN) 
belongs to the a 54 family and recognizes -24/-12-type pro- 
moters. Putative promoter consensuses have been identi- 
fied corresponding to seven of the 15 S. meliloti o factors 
(a* 1 , o^ 2 , o^ 9 , a 141 , a" 2 , a 54 , and a 70 ), and for promoters 
recognized by both or" 1 and o" 2 (cr* 172 ). 

Our promoter prediction procedure is illustrated in 
Additional file 1: Figure S3 and described in detail in 
Methods. Briefly, previously identified promoter sequences 
for a 70 , a* 11 , o^ 12 , o^ 1172 , a 54 , and o^ 2 were used to calculate 
position-specific scoring matrices (PSSM) [27,62,65], 
which were then used in a genome-wide search for pro- 
moter motifs. The resulting motifs were correlated with 
the locations of all mapped TSS and deemed to be 
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putative promoters if located in the appropriate position 
relative to the TSS (-24/-12 for a 54 , -35/-10 for all others; 
Methods). For prediction of o^ 1 and o* E9 promoters, we 
used a different approach. These two a factors belong to 
the ECF class of o* factors, which are often autoregulatory 



on their own operons [66] . We had previously shown that 
expression of rpoEl and rpoE4 is dramatically increased in 
S. meliloti 1021, compared to S. meliloti 2011 [67] due to 
a loss-of-function mutation in ecfRl, encoding the putative 
RpoEl anti-cr factor (Krol and Becker, unpublished results) 
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[68]. This suggested that RpoEl autoregulates expression 
of its own operon and that of rpoE4. Examination of re- 
gions upstream of the TSS for these two operons revealed 
identical motifs (GAAT-N18-GTCT). This motif was used 
in global string searches to identify additional putative c^ 1 
motifs upstream of TSS (Methods). For RpoE9, we also 
performed string searching, but with the promoter motif 
recently predicted in [69], TGTCACA-N16-CGTC, which 
is located 9 nt upstream of the TSS we identified for the 
SMb20029-rpoE9 operon. 

Overall, we predicted 2,847 promoter sequences up- 
stream of TSS (Figure 3a; Additional file 3: Tables S6- 
S13). o* 70 -dependent promoters make up the largest 
group of motifs (1,087, with 789 of these specific for 
a 70 only; Figure 3a; Additional file 3: Table S6). Simi- 
larity between the S. meliloti o 70 and cr H promoter 
consensuses was previously noted [62], thus as 
expected, we found significant overlap (n=298) be- 
tween the a 70 and o* H promoter sets (Figures 3a and 
3b). E. coli a 70 and a 32 promoter consensuses also 
share similar features [70], demonstrating that such 
overlap is not limited to S. meliloti, and that any in 
silico promoter prediction must be validated experi- 
mentally. In 2006, the first genome-wide prediction of 
S. meliloti o* 70 promoter motifs provided a set of 117 
experimentally validated (n=25) or computationally 
predicted (n=92) promoter motifs [65]. We confirmed 
that 55 of their 117 promoter sequences matched our 
TSS and a 70 promoter predictions. Ten promoter se- 
quences identified in their study were not associated 
with mTSS in our study, but instead to TSS of other 
categories. 

RpoH-dependent promoter motifs (o^ 11 , c^ 2 , o^ 1172 ) [62] 
were assigned to 1,556 unique TSS, of which 298 were also 
identified in searches with the a 70 promoter motif (Figure 3a; 
Additional file 3: Tables S7-S9). We identified extensive 
overlap among o^ 11 , o^ 12 , and o^ 1172 promoter predictions 
(Figure 3b; Additional file 3: Tables S7-S9). Barnett et al. 
[62] characterized the RpoHl and RpoH2 regulons during 
heat shock and stationary phase growth: in addition to the 
experimental validation of 69 TSS downstream of putative 
RpoH-dependent promoters (51 of which were confirmed 
in this study), they predicted 75 putative promoters 
upstream of protein-coding genes [62]. Our data set vali- 
dated -59% (n=44) of these predicted promoters and added 
647 previously unknown RpoH promoter motifs upstream 
of mRNA start sites (mTSS or lmTSS). Thirty- three pro- 
moter motifs are located upstream of r^afr-dependent 
genes, but were not previously identified by Barnett et al. 
[62]. Due to the similarity of RpoH and RpoD promoter 
motifs, the remaining set of 614 predicted RpoH promoters 
likely includes many false positives. RpoH-dependent regu- 
lation of the small non-coding RNAs SmelC781 and 
SmelC456 was proposed [62], and indeed, our analysis 



identified o^ 11 , o^ 12 , and o^ 1172 binding motifs upstream of 
SmelC781 and SmelC456 (Additional file 2: Table S2). 

RpoN (a 54 ) -type a factors are unique in that they require 
an activator to initiate transcription [3]. These activators, 
known as enhancer-binding proteins (EBP), interact with 
a 54 and bind DNA upstream of the RpoN-target pro- 
moters [3]. The S. meliloti genome is predicted to encode 
seven EBP (NifA, DctD, NtrC, NtrX, TacA, SMb20102, 
and SMb21200), but target genes have been identified for 
only the first three of these (reviewed in [25]). We identi- 
fied RpoN-dependent promoter motifs (a 54 ) upstream of 
112 TSS (Additional file 3: Table S10). Due to the lack of 
symbiotic samples in our RNA pools, we did not identify 
TSS upstream of the RpoN-dependent genes nifA, ntfB, 
nifH, glnll and SMa0872. Only four of our predicted pro- 
moters (dctA, glnB, glnK, SMb20436) were previously 
characterized as RpoN-dependent or predicted by in silico 
methods [25]; therefore, our data provide a useful resource 
for the identification of additional RpoN-regulated genes 
and characterization of EBP. 

For o^ 2 , 346 promoters were predicted (Additional file 3: 
Table Sll), of which 47 were previously found by Sauviac 
et al. [27]. An additional 58 of these promoters were ex- 
perimentally validated for rpoE2- dependence by micro- 
array (Figure 3c; see below). Furthermore, we identified 41 
putative c^ 1 promoter motifs (Additional file 3: Table 
S12). In contrast, only two additional promoter motifs 
were predicted for o^ 9 (Additional file 3: Table S13). This 
was not unexpected, as the RpoE9-type o* factors of 
Rhodobacter sphaeroides and Bacillus licheniformis were 
proposed to activate only their own operons [69] . 

Using our large set of newly predicted promoter mo- 
tifs, we recompiled each of the consensus sequences for 
a 70 , a 54 , o^ 2 , a" 1 , a" 2 , and a H172 . As expected, the 
resulting motifs (Additional file 1: Figure S4) were simi- 
lar to the previously reported consensus sequences 
[27,62,65]. As described before, the a 70 -10 region is 
poorly conserved in S. meliloti [62,65] (Additional file 1: 
Figure S4). 

Global identification of promoter motifs in S. meliloti 
was hitherto limited due to limited knowledge about 
TSS of individual genes. Combining the TSS identified 
in this study with the previously identified sRNAs [17] 
provided a rich resource for identification of promoters 
and their assignment to o factors. In summary, we were 
able to identify promoter motifs preceding -17.5% 
(1,100 of 6,296; Additional file 3: Tables S6-S13) of 
known protein-coding genes [33] and 633 non-coding 
RNA genes (131 sRNA and 502 asRNA). We also identi- 
fied 632 promoter motifs preceding 502 DNA regions 
corresponding to sense transcripts (distinct seTSS with 
one or more promoters). This finding supports the 
hypothesis that at least a small percentage of sense tran- 
scripts may represent intact transcripts, such as 
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regulatory sRNAs or mRNAs that are transcribed from 
promoters internal to the predicted ORF [5]. 

Transcription factor binding sites 

While the o* subunit of RNAP is the primary means of 
conferring specific transcript initiation, it has long been 
known that various transcription factors interact with bac- 
terial promoters to activate or repress transcription [3]. 
Activator-dependent promoters are common among those 
recognized by a 70 and often possess defective promoter 
elements [3]. Promoter searches using PSSM may fail to 
detect promoters possessing atypical core elements; there- 
fore, understanding the mechanisms by which activators 
and repressors act at promoters is essential for a complete 
view of any transcriptional landscape. Unfortunately, such 
mechanisms, including specific transcription factor bind- 
ing sites, are quite diverse and difficult to detect de novo 
on a global scale. However, TSS data can be used to pro- 
vide information on promoter structure where binding 
sites are predicted or known, and to guide discovery of 
transcription factor binding sites where some functional 
information exists. 

Additional file 4: Table S14 correlates previously 
reported transcription factor binding sites [34] with our 
TSS data. The group of predicted PhoB binding sites il- 
lustrates how TSS information can provide supporting 
empirical evidence for in silico binding site predictions. 
Under phosphate (Pi) limitation, the PhoR histidine kin- 
ase phosphorylates its cognate response regulator, PhoB, 
which interacts with conserved motifs (PhoB boxes) up- 
stream of genes involved in the Pi starvation response. 
While a large regulon of genes involved in S. meliloti re- 
sponse to Pi limitation has been identified, and several 
dozen PhoB boxes have been predicted [71,72], TSS have 
been determined for only three PhoB-dependent genes 
(phoC, phoX, and pstS in Additional file 2: Table S5). In 
E. colU the PhoB box overlaps and replaces the -35 a 70 
recognition elements of PhoB-dependent promoters 
[73]. Since the three S. meliloti PhoB-dependent genes 
with mapped TSS appeared to have poor -35 a 70 recogni- 
tion elements overlapping with PhoB motifs, we matched 
the locations of previously predicted PhoB boxes to our 
TSS. Of the 33 genes with predicted PhoB boxes and TSS 
in this study, 16 overlapped the -35 motif (i.e. were lo- 
cated 22 to 25 nt upstream of the TSS); and therefore, are 
strong candidates for PhoB binding sites (Additional file 4: 
Table S14). Additional PhoB boxes are located farther up- 
stream of TSS; these may be targets for other types of 
regulation, for example, PhoB-mediated repression, as was 
proposed for SMc02862 [74]. 

An example of how TSS data may guide binding site 
discovery is illustrated by the example of CtrA, a global 
cell cycle regulator. CtrA is autoregulatory upon its own 
promoter, which contains five verified CtrA binding sites 



(Figure Id) [38]. In contrast to C. crescentus, where the 
CtrA regulatory circuit is well characterized [75], little is 
known about CtrA targets in S. meliloti. We used PSSM 
to locate additional putative CtrA binding sites upstream 
of S. meliloti protein-coding genes and compared these to 
our TSS data (Methods). We identified 175 putative bind- 
ing motifs, representing 158 genes: -70% of the motifs are 
located upstream of TSS or overlap TSS (Additional file 4: 
Table SI 5). We confirmed CtrA motifs upstream of TSS 
corresponding to S. meliloti cell cycle-related genes previ- 
ously predicted to possess such motifs (minC, chpT, rcdA, 
pleC) [76,77], and identified CtrA motifs upstream of rpoD 
and ftsE, previously shown to bind CtrA in Brucella abor- 
tus [78] . Other S. meliloti genes with CtrA motifs that are 
plausibly cell cycle regulated or were shown to be direct 
CtrA targets in C. crescentus include pod], mraZ, metK, 
clpP2, ftsK, flaA, flaC, flaD, mcpE, mcpY, and SMc00651 
(Additional file 4: Table S15). Our data also suggest poten- 
tial regulatory mechanisms; for example, S. meliloti gcrA, 
an ortholog of C. crescentus gcrA encoding a master cell 
cycle regulator that activates ctrA transcription [75], is 
transcribed divergently from argD and possesses a strong 
CtrA motif overlapping the gcrA TSS, but on the opposite 
strand. A weaker CtrA motif that did not make our cutoff 
(TAA-N 7 -TTACT; P-value = 3.4xl0" 4 ) is located on the 
same strand as the gcrA TSS and almost completely over- 
laps the strong motif. The regulatory significance of this 
arrangement is unknown, but we speculate that CtrA may 
mediate a complex mechanism of both activation and re- 
pression at the gcrA promoter. 

de novo promoter motif discovery 

Of the 4,925 protein-coding genes for which we did not 
identify a promoter, 3,468 lacked TSS with >30 reads. The 
remaining 1,457 genes had one or more TSS (mTSS or 
lmTSS), but lacked an upstream promoter prediction. 
These may represent genes transcribed by a 70 , a 54 , o^ 1 , 
o^ 2 , o^ 9 , o^ 1 , or o^ 12 , but whose promoters contain atypical 
core motifs, or genes whose promoters are recognized by 
one more of the other eight S. meliloti o* factors (c^ 3 , o^ 4 , 
o^ 5 , o^ 6 , o^ 7 , o^ 8 , a™, and c/ ecl ). In order to determine 
whether any of these genes lacking a promoter prediction 
share similar, as yet to be identified promoter motifs, we 
extracted sequences 40 nt upstream of high coverage TSS 
(>300 reads) and used them as input for MEME [79]; 
Methods). Using this set of 482 upstream sequences, we 
identified three motifs with E-values of < 1 (Additional 
file 1: Figure S5). Motif 1 was identified upstream of 159 
genes and consists of a conserved upstream TTG and a 
conserved downstream A residue (Additional file 1: Figure 
S5). In addition to genes lacking an annotated function 
(-40%), the list includes genes involved in housekeeping, 
transport, metabolic, and regulatory functions (Additional 
file 4: Table SI 6). Therefore, we speculate that Motif 1 
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corresponds to genes transcribed by RpoD that possess a 
conserved TTG motif, but lack other features of the RpoD 
consensus sequence (Additional file 1: Figure S4). Motif 2 
contains a central TTTGTTAACCAT sequence and was 
identified upstream of 19 genes (Additional file 1: Figure 
S5; Additional file 4: Table SI 6), 84% of which were previ- 
ously shown to have high expression in rich medium, yet 
decreased expression in nodule bacteria and during heat 
shock [62,80]. Motif 2 does not appear to match any previ- 
ously identified S. meliloti promoter motifs or any motifs 
in the manually curated prokaryotic PSSM database, 
RegTransBase [81]. Motif 3 (TTCA-N7-TTCA) was iden- 
tified upstream of 12 genes (Additional file 1: Figure S5; 
Additional file 4: Table S16), five of which were previously 
identified as FeuP-dependent [82]. FeuP encodes a re- 
sponse regulator required for bacterial invasion of root 
nodules and activation of at least 14 genes [82], including 
ndvA encoding a cyclic a-glucan exporter, and feuN, a 
negative modulator of the FeuPQ pathway that is pre- 
dicted to be cotranscribed with feuPQ (Additional file 4: 
Table S17) [83]. We checked whether Motif 3 was present 
upstream of the remaining nine FeuP-dependent genes: 
three (including feuN) had motifs in appropriate distance 
upstream of TSS and three had motifs upstream of ORFs 
for which TSS were not identified (Additional file 4: Table 
S17). Therefore, we suggest that Motif 3 is a candidate 
FeuP binding site. In sum, our identification of three novel 
motifs upstream of TSS demonstrates the utility of global 
TSS data as a discovery tool. 

The S. meliloti RpoE2 (o E2 ) regulon 

The S. meliloti ECF o* factor, RpoE2 (o^ 2 ), responds to 
heat shock, osmotic stress, oxidative stress, and carbon/ 
nitrogen starvation during stationary phase growth 
[27,84], activating expression of at least 45 genes [27]. 
Most a-proteobacteria possess RpoE2-like o* factors and 
their role was proposed to be analogous to that of RpoS in 
E. coli [85]. The activity of o^ 2 is controlled by a partner 
switching mechanism, which includes negative regulation 
via two paralogous anti-o* factors, RsiAl (SMc01505) and 
RsiA2 (SMc04884), and positive regulation via two anti- 
anti-a factors, RsiBl (SMc01504) and RsiB2 (SMc00794) 
[27,86]. RsiBl and RsiB2 each contain a C-terminal re- 
ceiver domain, which is phosphorylated by an unidentified 
histidine kinase (s), allowing the N- terminal ECF a-like do- 
main to interact with RsiAl or RsiA2, thereby relieving in- 
hibition ofo^ 2 [86]. 

A previous study identified r^>o£2-dependent genes in 
S. meliloti 1021 by comparing transcription profiles of an 
rpoE2 mutant and wild type during heat shock [27]. To 
identify additional RpoE2 target genes, independent of any 
specific stress condition, and to verify RpoE2-dependent 
regulation of genes associated with putative c^ 2 TSS iden- 
tified in this study, we used custom Affymetrix GeneChips 



to obtain transcription profiles of S. meliloti strains either 
overexpressing rpoE2 from an inducible promoter or car- 
rying the empty vector (Methods). As mentioned above, 
S. meliloti 1021 has a defective ecfRl gene, resulting in 
high constitutive RpoEl (o^ 1 ) activity. Sigma factors 
compete with each other for RNAP [3]; therefore, to 
maximize the amount of RNAP-o^ 2 , and presumably en- 
hance identification of o^ 2 targets, we performed our ex- 
periments in a strain containing a functional ecfRl gene 
(CL150; Methods). 

Our analyses revealed that 202 protein-coding genes 
displayed more than twofold increased transcript abun- 
dance in the strain overexpressing rpoE2, whereas three 
mRNAs showed lower abundance (Additional file 5: Table 

518) . The majority of the protein-coding genes (n=150) 
lack a predicted function (Additional file 5: Table SI 8). 
About 95% of rpoE2-dependent transcripts previously 
identified by microarray [27], showed increased expression 
in the rpoE2 overexpressing strain. In addition, we identi- 
fied 161 new r^o£2-dependent protein-coding genes (fold 
change > 2), 41 of which confirm previous in silico predic- 
tions [27]. Overall, -72% of rpoE2-dependent protein- 
coding genes had one or more mapped TSS in our 
RNAseq data (Additional file 5: Tables S18 and S20). 

As expected, expression of regulatory genes rpoH2 
(SMc03873), rpoES (SMb21484), rsiAl (SMc01505), and 
rsiBl (SMc01504) was r/?aE2-dependent; moreover, each of 
these genes has an mTSS downstream from an RpoE2- 
dependent promoter (Figure 4a; Additional file 5: Table 

519) . In addition, we identified TSS with o^ 2 motifs up- 
stream of two operons encoding rpoE2-dependent response 
regulator-histidine kinase pairs: SMa0113/114 and exsFG 
(SMb20933/934; Additional file 5: Table S19). These obser- 
vations demonstrate the complexity of the RpoE2 regula- 
tory circuit and suggest that rpoE2 overexpression may 
result in multiple, secondary regulatory effects. Thus, 76 
r^aE2-dependent protein-coding genes possess a TSS pre- 
ceded by a ^-binding motif: these are likely directly con- 
trolled by o^ 2 , while transcription of the remaining genes 
likely depends on other o* factors (Figures 4a and 4c, 
Additional file 5: Table S19). For example, 22 rpoE2- 
dependent operons were previously shown to be rpoH- 
dependent [62]. Fifteen (68%) of these had an RNAseq TSS 
preceded by a promoter prediction for RpoE2 (n=5), RpoH 
(n=9), or both (n=l) (Additional file 5: Tables S18 and S19). 

In addition to detecting mRNAs corresponding to 
protein-coding genes, our custom Affymetrix chip detects 
transcriptional activity for IGRs > 150 nt. The strain 
overexpressing rpoE2 showed increased hybridization, 
compared to the control strain, to 63 probe sets, corre- 
sponding to 56 IGRs, and decreased hybridization to two 
probe sets (Figure 3c; Additional file 5: Tables S18 and 

520) . Our RNAseq approach detected transcripts corre- 
sponding to 43 of these IGRs. About 60% of IGRs showing 
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rpoE2-dependent hybridization are adjacent to rpoE2- 
dependent protein-coding genes, and therefore likely rep- 
resent 5'- or 3'-UTR of those genes (Figure 3c; Additional 
file 5:Tables S18 and S20). IGRs corresponding to 5'-UTRs 
of six genes SMa0022, SMa2357, SMb20413, SMb20705, 
SMc00794, and SMc00922 showed increased expression 
in the rpoE2 overexpressing strain that did not correlate 
with that of their corresponding coding regions (Figure 3c; 
Additional file 5: Table S18); however, we confirmed rpoE2 
dependence for three of these by 5' RACE mapping 
(SMa2357, SMb20705, SMc00922; see below). Five 
protein-coding genes, newly identified in this study, corre- 
lated with rpoE2-dependent IGRs identified by Affymetrix 
analysis (Figure 4c; Additional file 5: Table S20). 

o^ 2 promoter motifs were predicted upstream of 23 
sRNAs (Figure 3a, Additional file 5: Table S20). Our 
microarray data provides evidence for r^aE2-dependent 
transcription in eight IGRs where sRNA genes are located: 
SmelA018, SmelA060, SMa_sRNA_126, SMb_sRNA_381, 
SMb_sRNA_432 (SmelB126), SMb_sRNA_433 (Smel- 
B127ov), SMc_sRNA_259, and SMc_sRNA_298 (Figures 3c 
and 4c; Additional file 5: Table S20). This suggests that 
sRNA mediated post-transcriptional control is part of the 
RpoE2 regulatory circuit. 

o^ 2 promoter motifs were identified upstream of 40 
asRNAs (Figure 3a; Additional file 3: Table Sll). At the 
post-transcriptional level, ds-encoded asRNAs and 
trans-encoded sRNAs may play a role in complex regu- 
latory circuits by allowing for interference between tran- 
scripts of different o* factor regulons. For example, eight 
rpoE2-dependent asRNAs are located in antisense to 
genes are predicted to be controlled by a 70 , a 14 or both 
(Figure 5a). Figure 5b shows an example for the o* E2 - 
dependent SMc_asRNA_904, which may interact with 
the a 70 -dependent SMc02217 mRNA. In addition, 70 
asRNAs were found in antisense to r^o£2-dependent 
genes. We predicted binding motifs for a 70 , a 14 , or both, 
upstream of 17 of these asRNAs. An example is the o* 70 - 
dependent SMc_asRNA_1559, which may interact with 
the rpoE2-dependent SMc04232 mRNA (Figures 5a and 
5b). These observations strongly support the hypothesis 
that asRNAs belonging to a distinct o* factor regulon can 
interact with targets of different o* factor regulons. 

5' RACE mapping of rpo£2-dependent transcripts 

To confirm RNAseq and Affymetrix data, we chose 26 
genes for 5' RACE (random amplification of cDNA ends) 
mapping (Methods). For these experiments, we used the 
same RNA samples as were used for the Affymetrix ana- 
lyses. Because IGR probe sets on our custom Affymetrix 
GeneChip are a low- resolution means of delineating 
transcript structure, we examined TSS and rpoE2 de- 
pendence for putative sRNAs, newly identified ORFs, 
and 5'-UTR detected by these IGR probe sets. Also, 



since RpoE2 was shown to activate expression of rpoH2 
[27], we tested seven rpoE2- dependent genes whose ex- 
pression was previously shown to be ^//-dependent 
[62]. As controls, we included two regulatory genes whose 
expression was not rpoE2-dependent, rpoHl and rsiA2. Re- 
sults of 5' RACE mapping are provided in Additional file 5: 
Table S21, and gel photos showing results for four repre- 
sentative transcripts are shown in Additional file 1: Figure 
S6. Of the 26 RACE mapped genes, 24 had TSS in the 
RNAseq data, and all but three of these TSS matched our 
RACE determined start sites exactly. rpoE2 dependence 
was confirmed qualitatively by 5' RACE for all genes tested 
except for SMb20413, and as expected, rpoHl and rsiA2. 
Most mapped promoters had upstream matches to the 
RpoE2 promoter consensus sequence, thus the correspond- 
ing genes are likely direct targets of RpoE2. Two genes pre- 
viously identified as ^//-dependent (SMc01507 and 
SMc03968; [62]), had RpoH-like promoters. 

Conclusions 

To our knowledge, this study reports the first genome-wide 
RNAseq identification of TSS in an a-proteobacterium. Our 
approach defined a set of 17,001 TSS that provides mRNA 
start sites for -45% of S. meliloti protein-coding genes and 
identifies new ORFs and ncRNAs, including asRNAs com- 
plementary to about one third of protein-coding genes. Over 
2,000 TSS are preceded by at least one predicted promoter 
sequence, representing potential target promoters for nearly 
half of the known S. meliloti o factors. As promoter consen- 
sus sequences become available for the remaining o* factors, 
our data set can be used to identify target genes of these o* 
factors. Also, we demonstrated how TSS identification on a 
global scale aids in discovery of novel promoter motifs and 
transcription factor binding sites. Combined with our 
Affymetrix GeneChip analyses, the RNAseq data expand the 
known regulon of the o* factor, RpoE2, and identify candi- 
date ncRNAs that may be involved in the S. meliloti general 
stress response. In sum, our work provides a rich resource 
for continued study of S. meliloti gene expression and regu- 
latory circuits, as well as a foundation for comparative stud- 
ies of transcription in other a-proteobacterial species. 

Methods 

Strains and growth conditions for RNAseq experiments 

The sequenced reference strain, S. meliloti strain 1021 
[20,87], and S. meliloti strain 1021expR + were used for the 
RNAseq experiments. S. meliloti strain 1021expR + was 
constructed by replacing expR, disrupted in 1021 by an 
insertion element [88], with a functional expR gene. 
To repair expR, the corresponding genomic sequence 
of strain Rm8530 [89] was amplified using primers 5'- 
ACACAAGCTTCTTCTGAACGGCGTATTCACA and 
5'-TGATGAATTCCATTCCGTCGGCGAGATAGT and 
cloned into the Hindlll and EcoRl restriction sites of 



Schluter et al. BMC Genomics 2013, 14:156 
http://www.biomedcentral.eom/1 471 -21 64/1 4/1 56 



Page 13 of 21 



gene id 


avg 
SLR 


-35 


spacer 


-10 




gene id 


avg 
SLR 


-35 


spacer 


-10 




gene flanking igr 


avg. 
SLR 


-35 


spacer 


-10 




promoter 
position 


gene id 


SMa0113 


2.81 




AACCGCGTCTGGT 




ATGCACG 




4.60 


GGAACAT 


AGACGTCGCTTCC 


GAGTTG 


ATGACGCTTA 


SMa0134-SMa0136 




GGAACCA 


ATCTCCGTCCGGC 




CTGGGACG 


72070 


72095 


SMa5002 


SMa0280 


2.64 


GGAACTT 


ATAGTCCTG CCA A 


CGATTT 


CACAAGAA 


SMb21483 


4.06 


GGAACCA 


AAAGCGACATCAC 


CAGTTT 


GAGTGGTC 


SMa0436-SMa0439 


144 


GCAACTG 


CCTTGTTGTGCGA 


TGATTC 
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sRNA SmelA018 
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Figure 4 Promoter predictions for rpoE2-dependent genes and intergenic regions (IGRs). (a) Protein-coding genes whose transcripts 
showed increased abundance when rpoE2 was overexpressed, and for which an mTSS and o E2 promoter motif was assigned. Genes in bold were 
predicted to possess o 70 or o H promoters, in addition to the o E2 promoter, (b) Upstream regions of SMc02894 and SMb20522 showing 
transcription start sites (indicated by blue triangles) and putative promoters. The height of each TSS symbol corresponds to the log 10 coverage 
for mapped reads (y-axis). (c) IGRs whose transcripts showed increased abundance when rpoE2 was overexpressed. 



pK18mobsac [90]. The construct was introduced into the 
genome of Rml021 by conjugation. Transconjugants were 
subjected to sucrose selection on LB (Luria-Bertani) agar 
plates [91] with 10% sucrose, which allowed for selection 
of the 1021expR + strain. 

For large-scale identification of TSS in the genome of 
S. meliloti 1021, RNA samples representing 16 different 
growth and stress conditions, each in three biological rep- 
licates, were purified and pooled. Unless stated otherwise, 
50 ml cultures of S. meliloti strains 1021 and 1021expR + 
were inoculated to a starting OD 600 =0.02 in 250 ml flasks 
and incubated at 30°C with shaking (180 rpm) in TY [92], 
VMM [93] or MOPS-MM medium [94], supplemented 
with 8 ug/ml nalidixic acid. 

Exponential and stationary phase RNA samples were 
purified from S. meliloti 1021 cultures grown in TY 
(OD 600 =0.8 or 3), VMM (OD 600 =0.6 or 1.3), and MOPS- 
MM (OD 600 =0.5). Exponential phase cultures of 
S. meliloti !021expR + were grown in MOPS-MM 
(OD 600 =0.5). For exposure to various environmental 



stresses, the following modifications were made. All 
stress treatments were applied to strain 1021 only, un- 
less noted otherwise: 

Temperature shifts for heat (40°C) and cold shock 
(20°C) were applied for 30 min to cultures grown in TY. 

Exposure to microoxia was performed by flushing TY- 
grown cultures with pure nitrogen (N 2 ) for 1 -minute, 
followed by incubation of cultures under a N 2 atmosphere 
for 45 minutes. 

Salt stress was carried out by adding sodium chloride 
(NaCl) to exponential phase, VMM-grown cultures at final 
concentration of 0.4 M for 20 minutes. Oxidative stress 
exposure was performed similarly, except instead of NaCl, 
H 2 0 2 was added to a final concentration of 10 mM. 

Exposure to acid and alkaline stress was as follows: 
cultures were grown to an OD 600 of 0.6, centrifuged, and 
the resulting cell pellets resuspended in VMM modified 
by addition of either HC1 to pH 5.8 or NaOH to pH 8.5. 

Cultures were starved for nitrogen or carbon by grow- 
ing them to an OD 600 of 0.6 in replete MOPS-MM, 
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CTTGAC 


GGATTCGCTGCGGGCC 
GGATTCG CTG CG G G CCTCG 


TCGAAAT 
AAATAG 


rpoHl 
rpoD 


SMc00928 
SMc00928 


rpoE2 


SMc_ 


TSS06676 


SMc_ 


asRNA. 


.1637 


2252929 
2252927 


2252956 - 
2252956 - 


GTTGAT 
GTTGAT 


CACCGATCACCAATG 
CACCGATCACCAATGGCA 


GCAACAT 
ACATAT 


rpoH2 
rpoD 


SMc01409 
SMc01409 


rpoE2 


SMc. 


TSS06775 


SMc. 


asRNA. 


.1653 


2275582 
2275582 


2275610 + 
2275610 + 


ATTGAT 
ATTGAT 


CGAGGACGGGTCGATG 
CGAGGACGGGTCGATGT 


TCGATGT 
CGATGT 


rpoHl 
rpoD 


SMC01431 
SMC01431 


rpoE2 


SMc_ 


TSS03340 


SMc_ 


.asRNA. 


.1188 


1218339 


1218368 - 


TTTGCC 


AC AG CGTATC ACCG CG G C 


GGATGA 


rpoD 


SMc02543 


rpoE2 


SMc_ 


TSS09494 


SMc_ 


asRNA. 


2094 


3358675 


3358703 - 


GTTGCT 


TTTTGTCG ATTGTAG C 


CTTAACA 


rpoH2 


SMc03246 


rpoE2 


SMc_ 


TSS09824 


SMc_ 


asRNA. 


2169 


3430960 
3430960 


3430988 + 
3430988 + 


TTTGAC 
TTTGAC 


CAAGTACGGCCCCATC 
CAAGTACGGCCCCATCA 


AGTATGT 
GTATGT 


rpoHlH2 
rpoD 


SMc03765 
SMc03765 


rpoE2 


SMc. 


TSS09830 


SMc. 


asRNA. 


.2170 


3431399 


3431429 + 


CTTTCA 


ATCGACCCCCGCCCCAGAA 


AAATGT 


rpoD 


SMc03765 


SMc_ 


TSS09834 


SMc_ 


.asRNA. 


.2172 


3431652 


3431680 + 


CTTTCG 


AATCTTTTCGTCCG CG G 


CTAAGA 


rpoD 


SMc03765 




SMc. 


TSS06226 


SMc. 


asRNA. 


.1559 


2059217 


2059246 - 


ATTGCT 


CTATTTGCCCGGCCGTGA 


GGATAT 


rpoD 


SMc04232 


rpoE2 


a E2 -dependent asRNAs and ct 70/h -dependent target genes 


SMa. 


JSS2436 


SMa. 


asRNA 


.527 


974797 


974822 - 


TGAATGA 


GCAGTAGGGGCTT 


CCCTTC 


rpoE2 


SMal727 


rpoH2 


SMa_ 


JSS2771 


SMa_ 


asRNA 


_599 


1119723 


1119748 - 


GGAACAC 


CCGTCAAATACTG 


GTATTC 


rpoE2 


SMal967 


rpoD 


SMc_ 


TSS02860 


SMc_ 


asRNA. 


.1133 


1084344 


1084370 + 


GGAAAGT 


GCCGCCGGAACAGT 


CCCTTA 


rpoE2 


SMc00117 


rpoD, rpoHl 


SMc. 


TSS01443 


SMc. 


asRNA. 


904 


579497 


579522 + 


GGAACAG 


CGGCGTGGCGAAG 


ACGTTG 


rpoE2 


SMc02217 


rpoD 


SMc. 


TSS03025 


SMc. 


asRNA. 


.1152 


1143481 


1143506 - 


GGAACTG 


ATCGCCCCGGCCG 


ATATTT 


rpoE2 


SMc02390 


rpoHlH2,rpoHl,rpoH2, rpoD 


SMc. 


TSS08947 


SMc. 


asRNA. 


.2014 


3132019 


3132044 - 


GGGACAC 


G AC AG ATCG CG C A 


GTATTT 


rpoE2 


SMC03171 


rpoHlH2,rpoHl,rpoH2, rpoD 


SMc_ 


TSS08991 


SMc_ 


asRNA. 


2020 


3159670 


3159695 + 


GGAATTC 


CCCGCCCAGAACG 


GCTTTC 


rpoE2 


SMC03195 


rpoHlH2,rpoHl,rpoH2, rpoD 


SMc_ 


TSS10147 


SMc_ 


asRNA. 


.2208 


3528163 


3528188 - 


GGATCAT 


CGCATCCCAGCTC 


GTGTTC 


rpoE2 


SMc03864 


rpoHlH2,rpoHl, rpoD 



GGAACAGCGGCGTGGCGAAGACGTTGTCGACCACCA 

-35 E2 -10 
q promoter 



578,883 



SMc asRNA 904 




r 10 3 



10 2 



-10 CT Pr ° m0ter -35 

■ATTTCGAGGTTATGACGGACCGGCGCGAACGCAGTTA 



18 II 

CD ^ 
CO 
< 

-10 2 § 



MO 3 



18 



z 

10' 



GGAACATTTCTCTCCCGGTTCGAATTTAAGTTGA-N 2 . 
-35 E2 . -10 

a promoter 



2,059,036 



2,059,614 
SMc-04232 

SMc_asRNA_1559 

2,059,209 



TATAGGAGTGCCGGCCCGTTTATCTCGTTA 

10 70 . -35 
a promoter 



10- 

Figure 5 Identification of potential interactions between antisense RNAs (asRNAs) and transcripts of different o factor regulons. 

(a) o 70 -, o H -, and o 70/H -dependent asRNAs located in antisense to o E2 -dependent target genes (top panel) and o E2 -dependent asRNAs located in 
antisense to o 70 -, o H -, and o 70/H -dependent target genes (bottom panel), (b) Genomic location of the asRNAs SMc_asRNA_904 and 
SMc_asRNA_1 559 showing transcription start sites (indicated by blue triangles) and putative promoters. The height of each TSS symbol 
corresponds to the log 10 coverage for mapped reads (y-axis). 



followed by centrifugation and resuspension of the cell 
pellets in MOPS -MM lacking either a nitrogen or car- 
bon source [95]. S. meliloti 1021 and 1021expR + cultures 
were subjected to phosphate limitation in MOPS -MM 
with 0.1 mM phosphate as described previously [71]. 

For induction with the flavonoid luteolin, cultures were 
grown in TY to an OD 600 of 0.8, luteolin was added to a 



final concentration of 10 |iM, and cells were harvested after 
4 hours incubation. 

RNA preparation for RNAseq 

Total RNA was isolated using miRNeasy Mini kits and a 
QIAcube automated nucleic acid preparation platform, 
according to manufacturers instructions (Qiagen). RNA 
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samples were further purified by phenol/chloroform/ 
isoamyl alcohol extraction and ethanol precipitation. 
The integrity of all RNA samples was confirmed using a 
RNA 6000 Nano Kit with a Bioanalyzer 2100 (Agilent 
Technologies). 

cDNA library preparation and RNAseq 

The pooled RNA sample was sequentially treated with two 
enzymes: (i) terminator 5'-phosphate-dependent exonucle- 
ase (TPE; Epicentre), which specifically degrades RNA spe- 
cies with 5' monophosphates (processed transcripts), and 
(ii) tobacco acid pyrophosphatase (TAP; Epicentre), which 
prepares the TPE-resistant transcripts from (i) for adapter 
ligation by removing pyrophosphates from their 5' ends. 
Subsequently, an adapter comprised of sequences comple- 
mentary to the sequencing and amplification primers was 
ligated to the 5' phosphate of the pretreated mRNA. First- 
strand cDNA synthesis was then performed with an N6 
randomized adapter primer and M-MLV-RNase H reverse 
transcriptase (New England Biolabs). The resulting cDNA 
was PCR-amplified in 24 cycles. The primers used for PCR 
amplification were designed for amplicon sequencing 
according to Illumina/Solexa guidelines. The following 
adapter sequences flanked the cDNA inserts: 5' end, 5'- A 
ATGATACGGCGACCACCGACAGGTTCAGAGTTCTA 
CAGTCCGACGATC TCCA-3' and 3' end, 5'- CAAG 
CAGAAGACGGCATACGAT CAGGCAGAGGACGAGA- 
A-3' (Illumina adapter sequences are underlined). Prior to 
sequencing, double stranded cDNA was enriched for frag- 
ments of 300 to 500 nt (the optimal range for our sequen- 
cing method) by size fractionation. RNAseq was performed 
on the Illumina HiSeq 2000 sequencing system by GATC 
Biotech (Konstanz, Germany). 

We note that while our procedure resulted in an en- 
richment of primary 5' ends, such enrichment is never 
complete. Hence, a detectable proportion of a given abun- 
dant transcript with a processed 5' end can survive the en- 
richment procedure. In addition, the specific tertiary 
structure of some processed RNAs may result in their 5' 
ends being stably protected against TPE degradation. 
Therefore, for some genes, both primary and processed 
transcripts may be represented in our data set. 

Computational methods 

Identification of transcription start sites (TSS) 

Sequences of the S. meliloti 1021 genome, accession 
numbers NC_003037 (pSymA), NC_003047 (chromo- 
some), NC_003078 (pSymB), and gene annotations, in- 
cluding the set of sRNAs previously identified by 454 
sequencing [17], were downloaded from NCBI and the 
RhizoGate Portal [33], respectively. 

Sequencing reads from the cDNA library were processed 
and mapped to the S. meliloti 1021 genome using segemehl 
[96]. From an initial set of 37,792,608 reads, 35,043,949 



reads > 18 nt were mapped to the reference sequence. To 
enrich for the number of reads originating from a TSS (the 
5' end of a transcript), only reads that mapped to a unique 
position in the genome, and without mismatches in the first 
three bases of the 5 f end, were retained for downstream 
analysis. This resulted in a final set of 11,230,166 reads. The 
overall workflow is outlined in Additional file 1: Figure SI. 

The next step in the workflow was to identify TSS. We 
assembled reads into TSS contigs, assemblies of at least 
30 mapped reads sharing a common 5' end. Thus, the 5' 
end of a TSS contig constitutes a TSS, while the 3' end of 
the longest read determines the end position of the TSS 
contig. If the distance of two TSS positions differed by less 
than 3 nt, they were treated as a single TSS and merged. 
Altogether, we identified 17,001 TSS contigs: 15,056 that 
originated from a single TSS contig, and 1,945 that were 
assembled from two or more TSS contigs. TSS were 
named using the scheme: SMx_TSS_nnnn where x = A, B 
or C and denotes a location on pSymA, pSymB or the 
chromosome, respectively. 

Based on a minimal transcriptional unit (MTU) model, 
we derived six classes of TSS: lmTSS, mTSS, pmTSS, 
seTSS, asTSS, and sTSS (see below). These classes reflect 
the relative position and orientation of a TSS contig to its 
genomic context. The TSS classes are organized hierarch- 
ically to guarantee that TSS contigs that meet definitions 
of multiple classes are assigned to a single class. 

The MTU of a gene is divided in three consecutive re- 
gions: 5'-untranslated region (UTR), coding sequence 
(CDS), and 3'-UTR. Nucleotide numbering starts with 1 
at the first nucleotide of the 5'-UTR. The size of the 
coding segment is delimited by the position of its start 
and stop codon. In contrast, we used fixed length 
thresholds for the UTRs, which represent the minimal 
required sizes of ribosome binding sites (RBS; 14 nt) and 
terminator hairpins (20 nt) within the 5'-UTR and the 
3'-UTR, respectively. The promoter region upstream of 
the MTU comprises nucleotides from position -40 to -1. 

TSS class definitions 
mTSS 

These TSS are probably associated with mRNAs. To asso- 
ciate identified TSS with protein-coding genes, the defined 
minimal length of 14 nt of the 5'-UTR, the defined minimal 
length of 40 additional nt for the promoter region, and the 
length range of the sequence contigs assembled from our 
data were taken into account. Contigs ranged from 87 to 
108 nt. If a contig was in the same orientation and over- 
lapped the minimal region of 14 plus 40 nt upstream of the 
start codon, we assumed that it is more likely associated to 
an mRNA than to a trans-encoded sRNA and therefore 
classified its prominent 5' end as an mTSS. 

Also, to identify additional mTSS, we included cis- 
encoded mRNA leader sequences from the 454 RNAseq 
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data [17] in our analyses (see below). Thus, a TSS contig, 
which is located upstream of a promoter region and 
without overlapping it, is categorized as mTSS, if there 
exists a ds-encoded mRNA leader sequence for this 
MTU that starts at the same position or upstream of 
the mTSS. 

ImTSS 

These TSS are associated with leaderless transcripts and 
match the first nucleotide of the corresponding AUG 
start codon. 

pmTSS 

These represent a class of putative mTSS where it is un- 
certain if the TSS belongs to an mRNA or to the tran- 
script of an sRNA. This classification is used because 
TSS located over 300 nt away from the corresponding 
start codon are rarely observed. TSS contigs of pmTSS 
do not overlap with any MTU in sense and antisense 
and are located at most 300 nt upstream in same orien- 
tation of a start codon. 

seTSS 

Sense TSS must meet two criteria: the TSS is located 
within an MTU, and has the same orientation as the asso- 
ciated MTU. This is true for all TSS within an MTU ex- 
cept for those that meet the definition of the ImTSS class. 

asTSS 

These TSS represent ds-encoded antisense RNAs. In 
contrast to seTSS, the asTSS contigs overlap MTUs in 
antisense orientation. 

sTSS 

These TSS represent trans-encoded sRNAs located in 
IGRs. Therefore sTSS contigs do not overlap any MTU. 
In the case where an MTU is located downstream, in 
the same orientation, a TSS is classified as an sTSS only 
if the distance between start codon and TSS exceeds the 
pmTSS threshold of 300 nt. 

Definition of trans-encoded sRNAs and c/s-encoded 
antisense RNAs 

After categorizing the TSS, we merged overlapping TSS 
contigs of sTSS and asTSS. The resulting trans-encoded 
sRNA and ds-encoded antisense RNA transcripts are named 
according to the following conventions, SMx_sRNA_n and 
SMx_asRNA_n, respectively. 

Identification of new sRNAs from 454 RNAseq data 

Schluter et al. [17] used a 454 RNAseq approach to 
identify non-coding RNAs shorter than 350 nt. In that 
study, 454 contigs were categorized as follows: trans- 
encoded sRNA, ds-encoded mRNA leader, sense sRNA, 



and ds-encoded antisense RNA. For this study, we 
reanalyzed the 454 RNAseq data using less stringent pa- 
rameters to identify additional sRNAs. We defined a 454 
contig as a region with continuous coverage of at least 5 
reads and a length between 50 and 350 nt. From this ini- 
tial set, we chose only those 454 contigs whose 5' end 
matched a TSS from this study. Schluter et al. [17] 
adopted a naming convention of SmelXnnn for their 
sRNAs; in order to distinguish our newly identified 
sRNAs we inserted an R (SmelXRnnn), indicating that 
relaxed parameters were applied. 

Comparative gene prediction 

If the TSS of an organism are known, one may guide the 
gene annotation process by using the genomic coordi- 
nates of the TSS as constraints for gene prediction. Here, 
we combined our TSS data with RBS (ribosome-binding 
site) and open reading frame (ORF) predictions, and car- 
ried out protein similarity searches to improve the 
S. meliloti genome annotation by finding new genes and 
correcting positions of start- and stop-codons. Our pro- 
cedure is presented in Additional file 1: Figure S2. 

For each TSS transcript, T, we performed the following 
steps: First, the set of putative ORFs that overlap with T 
were identified. Then, we distinguished between two 
kinds of ORFs, depending on the translation initiation 
mechanism: leaderless ORFs and Shine-Dalgarno (SD) 
initiation ORFs (sdORFs). While a leaderless ORF pos- 
sesses an AUG start codon and its 5' end coincides with 
the TSS, sdORFs are preceded by a leader sequence, with 
the TSS as the 5' end. Since the leader sequences contain 
an RBS, we set the minimum length threshold to 14 nt. 
In the next step, all ORF sequences were translated and 
matched against the NCBI non-redundant (nr) protein 
database using BLASTP. BLASTP hits with an E-value < 
10" 5 were selected if the BLASTP alignment covered the 
query sequence completely and if the matching protein 
and the query were of similar length. For sdORFs we used 
free_align to predict RBS on the leader sequences. 
free_align identifies an RBS as the energetically most fa- 
vorable hybridization site of a target leader sequence with 
the 3' tail of a 16S RNA by applying free energy calcula- 
tions. Following Starmer et al. [97], we set the energy cut- 
off for a RBS to -3.4535 kcal/mol. Subsequently, we 
classified sdORFs with RBSs and leaderless ORFs as either 
new ORFs or reannotated ORFs. New ORFs did not over- 
lap more than the start or stop codon of existing anno- 
tated ORFs. If an ORF O shared either the 5' or the 3' end, 
denoted by a start or stop codon, respectively, with a gene 
G, O was classified as a reannotated ORF. ORFs that did 
not fall into these two categories were discarded, as the 
purpose of the ORF prediction procedure was not to re- 
place current gene annotations with another gene 
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annotation having a different gene product. New ORFs 
were named following the scheme, SMx_ORF_n. All new 
ORFs and reannotation ORFs were included in the 
GenDB S. meliloti 1021 database [33]. 

Identification of promoter motifs 
Promoter consensus determination and in silico genome- 
wide predictions 

Parameters such as spacer threshold and promoter elem- 
ent size used to search for putative promoter sequences 
specific for RpoD, RpoE2, RpoEl, RpoE9, RpoHl, 
RpoH2, RpoHl/2, and RpoN are listed in Additional file 
5: Table S22. In contrast to previous studies, all regions 
upstream of TSS were included in our analyses, regard- 
less of its position in an IGR or a CDS. To discover pro- 
moter sequences, we followed two different strategies, 
depending on available input data: string-based or pro- 
file searching (see below). 

String-based promoter search for RpoEl and RpoE9 
promoters 

We used the consensus motifs, GAAT-N18-GTCT 
(RpoEl; unpublished observations; [68]) and TGTCACA- 
N16-CGTC (RpoE9; [69]), to perform pattern matching 
on the S. meliloti 1021 genome. We scanned for putative 
promoters that had no more than a single mismatch with 
respect to the consensus, a spacer length in the range spe- 
cified in Additional file 5: Table S22, and were located up- 
stream of a TSS with minimum and maximum distance of 
5 nt and 12 nt from the end of the promoter, respectively. 

Profile search of promoters 

Previously identified promoter sequences were used as in- 
put for the prediction of RpoD- (n=25), RpoHl- (n=20), 
RpoH2- (n=ll), RpoHl/H2- (n=14), and RpoN- (n=8) 
specific promoters [62,65,98]. For the RpoN promoter 
search model, we excluded two that were previously 
reported [98] because one was shown to be RpoD-like and 
the other was not found in S. meliloti strain 1021. For 
RpoE2 promoters, only one RpoE2-dependent TSS, that 
of the rsiAl-rpoE2 operon, had been mapped prior to this 
study [86]; therefore, we used a set of 35 putative pro- 
moters predicted previously by aligning upstream regions 
of a set of rj?o£2-dependent genes [27]. For purposes of 
our study, input promoters were designated as either "vir- 
tual", if they had been previously reported in the literature, 
but no TSS corresponding to the promoter was present in 
our RNAseq data, or as "model" if an RNAseq TSS was lo- 
cated an appropriate distance downstream (Additional file 
3: Tables S6-S11 and Additional file 5: Table S23). 

In view of the sparse promoter data, we implemented an 
iterative two-phase procedure to both identify the global pro- 
moter architecture and to refine the promoter consensus. 



The main analytical steps of our promoter prediction work- 
flow are outlined in Additional file 1: Figure S3. 

Before beginning Phase 1, we extended the set of input 
promoter sequences by searching for promoter regions 
identical to the -35 and -10 elements of each input 
promoter, and within an appropriate distance of a TSS 
(denoted "extended model " in Additional file 3: Tables 
S6-S11). 

In Phase 1, promoter candidates were identified by 
means of position-specific scoring matrices (PSSMs). A 
PSSM is the probabilistic representation of a promoter 
motif; it summarizes a conserved pattern by modeling 
nucleotide variability at each position of the motif. We 
conducted a profile search for seed promoter sequences, 
using the initial PSSMs obtained with the input pro- 
moter sequences. First, -35 and -10 elements were 
extracted from the input promoter sequences, aligned, 
and separate PSSMs constructed from the alignments 
with a pseudocount=l. After constructing PSSMs, we 
performed a profile search on the S. meliloti genome for 
putative promoter elements using PoSSuMsearch [99], 
Then, based on the promoter structure we computed 
chains of collinear -35 and -10 elements, such that a 
-35 element is located upstream of a -10 element and 
the distance between the elements complies with the 
spacer length threshold (Additional file 5: Table S22). 
Central to the promoter prediction process was the inte- 
gration of TSS and 454 contigs [17] with promoter can- 
didates. Promoter candidates were retained only if they 
were located 5 nt to 12 nt upstream of a TSS or 454 
contig. The last filter step of Phase 1 discarded all pro- 
moters that had promoter elements with P-value > 0.005 
or that were associated with seTSS (sense sRNAs) be- 
cause such predictions were more likely to be false posi- 
tives. This ensured that only promoters that were good 
representatives of their respective promoter motif were 
combined with the input promoter sequences to form a 
new set of seed promoter sequences (denoted "new from 
initial search" in Additional file 3: Tables S6-S11). 

In Phase 2, we started with the new set of promoter 
sequences obtained from Phase 1. All steps of Phase 1 
were then repeated with this new set and the same pa- 
rameters, except for the last filter step. Since the aim of 
Phase 2 was to obtain a comprehensive set of promoters 
and to elucidate promoter architecture, we lowered the 
restrictions by allowing all classes of TSS to be associ- 
ated with promoters and increased the P-value threshold 
to 0.02 (Additional file 1: Figure S3). The additional pro- 
moters identified in Phase 2 are denoted "second search" 
in Additional file 3: Tables S6-S11. 

Identification of CtrA binding motifs 

To identify putative CtrA binding sites upstream of S. 
meliloti protein-coding genes, we used a matrix-based 
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search method, RSA-tools-matrix-scan [100]. The PSSM 
used for searching included sequences of five experimen- 
tally verified CtrA binding sites upstream of ctrA 
(consensus=TAA-N 7 -TTAAC; [38]). Although one of these 
binding sites has 6 nt spacing, we searched only for motifs 
with 7 nt spacing because CtrA binding sites with spacing 
* 7 nt showed low affinity binding in C. crescentus [101]. 
We searched the direct strand of DNA upstream of all 
ORFs (pseudo-frequencies=0.01) and retained those hits 
that did not overlap ORFs and that had a positive weight 
score and P-value < lxlO 4 . 

de novo promoter motif discovery 

For de novo promoter motif discovery we compiled a set 
of 482 sequences representing 40 nt upstream of mTSS 
and lmTSS that were not assigned to a a factor and had 
a coverage of at least 300 reads (10 fold of the coverage 
threshold used to identify TSS). The sequences were 
used as input for MEME (multiple em for motif elicit- 
ation) version 4.8.1 [79], with the following parameters: 
-dna -mod anr -minw 18 -maxw 40 -minsites 10. Motifs 
with an E-value of <1 were examined further. 

Affymetrix GeneChip analysis and S r RACE mapping 

To identify genes whose expression was rpoE2- 
dependent, we used Affymetrix GeneChip analysis to 
compare two strains: CL150 pCAPll and CL150 
pF1084. CL150 is an S. meliloti 1021 strain in which two 
point mutations were corrected by gene replacement: 
ecfRl, encoding the RpoEl anti-o* factor, and pstC, en- 
coding a subunit of the high affinity phosphate trans- 
porter [37]. To repair ecfRl, the corresponding genomic 
sequence of strain Rm2011 [67] was amplified using 
primers 5'-CTCGAGCTGCGGAGACCGAAATGACC 
and 5'-CTCGAGTGCCTGGGAGAGCTATCTTG and 
cloned into the Smal restriction site of pJQ200 [102] to 
create pCL102P. pCL102P was introduced into the gen- 
ome of strain 1021 by conjugation. Transconjugants 
were subjected to sucrose selection on M9 agar plates 
with 10% sucrose, which allowed for selection of the 
1021ecfRl + strain, designated CL101. To repair pstC, the 
corresponding genomic sequence of strain RmPllO [37] 
was amplified using primers S'-ATACTAGTTGTTCT 
TCTACGTTAAGAAGGCTC and 5'-AACTCGAGGA 
TGAGCATGATGCCCATGACGAT and cloned into the 
Spel and Xhol restriction sites of pJQ200 [102] to create 
pMB772. pMB772 was introduced into the genome of 
strain CL101 by conjugation. Transconjugants were 
subjected to sucrose selection on M9 agar plates with 10% 
sucrose, which allowed for selection of the 1021ecfRl + , 
pstC + strain, designated CL150. pCAPll is a broad host 
range expression vector for exogenous expression of genes 
from the inducible melibiose promoter, VmelA [103]. A 
plasmid overexpressing rpoE2 was constructed by 



amplifying the rpoE2-codmg region from strain 1021 using 
primers 5'-GAGGATACGCGATGTCATCCGAAAACCA 
AG and 5'-AACACCTAGGTCACGAAACGAAGGCAC 
for 20 cycles, then adding the primer, 5'-GGGG CCTAG 
GACAGGAGGATACGCGAT, and amplifying for an add- 
itional 20 cycles. The resulting DNA was gel purified, 
digested with Avrll, and cloned into the Avrll site of 
pCAPll. Triplicate cultures of CL150 pCAPll (vector 
only control) and CL150 pF1084 were inoculated to an 
OD 600 of 0.05 in M9 glycerol medium and allowed to 
grow to midexponential phase (OD 600 =0.50-0.55), at 
which time melibiose was added to a final concentration 
of 0.4%. After 30 minutes, cells were harvested as previ- 
ously described [62]. RNA purification, cDNA synthesis, 
labeling, hybridization to custom Affymetrix GeneChips, 
and data analysis were performed as previously described 
[62]. The custom dual-genome GeneChip contains probe 
sets corresponding to coding sequences from the original 
S. meliloti genome annotation reported in reference [20], 
intergenic regions (IGRs) of > 150 nt, and to -10,000 host 
plant (Medicago truncatula) expressed sequence tags. In 
designing the IGR probe sets, large IGRs were equally 
subdivided to be < 375 bp and tiling of oligo-(25)-mers 
was evenly spaced on both strands. The Affymetrix data 
described in this study are available in the National Center 
for Biotechnology Informations Gene Expression Omni- 
bus [104] through the GEO Series accession number 
GSE40391. 

5' RACE (5' rapid amplification of cDNA ends) was 
used to determine transcription start sites for a subset of 
rpoE2-dependent genes, as well as for two regulatory 
genes whose expression was not rpoE2-dependent (rpoHl 
and rsiA2), Primers used for 5' RACE mapping are pro- 
vided in Additional file 5: Table S21. Altogether, we deter- 
mined transcription start sites for 26 genes, selected 
primarily to represent genes whose expression was 
detected by IGR probe sets or whose expression was pre- 
viously shown to be ^//-dependent [62]. 
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Additional file 1: Supplemental figures. 

Additional file 2: Table SI. - TSS associated with tRNAs, repeat sequences, 
and transposable elements; Table S2- TSS associated with CDS, sRNAs, and 
asRNA; Table S3- Summary of newly identified and reannotated open reading 
frames; Table S4- Comparison of experimentally validated and predicted 
operons to transcription start site (TSS) data; Table S5- Comparison of RNAseq 
TSS to previously published TSS. 

Additional file 3: Table S6. - RpoD promoter sequences and their 
corresponding TSS; Table S7- RpoHl promoter sequences and their 
corresponding TSS; Table S8- RpoH2 promoter sequences and their 
corresponding TSS; Table S9- Dual (RpoHl /H2) promoter sequences and 
their corresponding TSS; Table S10- RpoN promoter sequences and their 
corresponding TSS; Table S1 1 - RpoE2 promoter sequences and their 
corresponding TSS; Table S12- RpoEl promoter sequences and their 
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corresponding TSS; Table SI 3- RpoE9 promoter sequences and their 
corresponding TSS. 

Additional file 4: Table S14. - Correlation of transcriptional regulator 
binding motifs with TSS; Table SI 5- Predicted CtrA binding motifs and 
their corresponding TSS; Table S16- Promoter motifs identified by MEME 
analyses; Table SI 7- Predicted FeuP binding motifs compared to TSS 
identified in this study and to microarray data for 14 fei/P-dependent 
genes identified by Griffitts et al. [82]. 

Additional file 5: Table SI 8. - Results of transcriptome analyses to 
identify putative RpoE2 target genes; Table SI 9- rpoE2-dependent genes 
with transcriptional activity correlated with TSS and promoter predictions; 
Table S20- rpo/F2-d ependent intergenic regions with transcriptional 
activity correlated with TSS and promoter predictions; Table S21- 5' 
RACE mapping of TSS and comparison to RNAseq TSS; Table S22- 
Promoter-specific parameters used for promoter identification upstream 
of TSS; Table S23- Number of TSS assigned to each type of promoter 
sequence (virtual or model) used in the initial promoter search, by TSS 
category. 
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